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Abstract 

Biological prediction of transcription factor binding sites and their corresponding transcription factor target genes (TFTGs) 
makes great contribution to understanding the gene regulatory networks. However, these approaches are based on 
laborious and time-consuming biological experiments. Numerous computational approaches have shown great potential to 
circumvent laborious biological methods. However, the majority of these algorithms provide limited performances and fail 
to consider the structural property of the datasets. We proposed a refined systematic computational approach for 
predicting TFTGs. Based on previous work done on identifying auxin response factor target genes from Arabidopsis thaliana 
co-expression data, we adopted a novel reverse-complementary distance-sensitive n-gram profile algorithm. This algorithm 
converts each upstream sub-sequence into a high-dimensional vector data point and transforms the prediction task into a 
classification problem using support vector machine-based classifier. Our approach showed significant improvement 
compared to other computational methods based on the area under curve value of the receiver operating characteristic 
curve using 10-fold cross validation. In addition, in the light of the highly skewed structure of the dataset, we also evaluated 
other metrics and their associated curves, such as precision-recall curves and cost curves, which provided highly satisfactory 
results. 
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Introduction 

Unraveling the gene regulatory networks is regarded as one of 
the fundamental problems challenging biologists [1]. Gene 
expression is systematically controlled by regulatory proteins 
known as transcription factors (TFs) that bind to specific cognate 
DNA sites known as transcription factor binding sites (TFBSs). 
Through interacting with other a'j-elements, these TFs can 
function as repressors preventing transcription by inhibiting the 
activity of RNA polymerization complex either by directly binding 
to TFBSs or indirectly modifying transcription factor target genes 
(TFTGs). Transcription factors can also function as activators, 
which promote the expression of TFTGs. In addition to post- 
transcriptional gene regulations, there are also post-translational 
gene modification regulations, including biochemical alteration 
and RNA interference [2,3]. However, the interplay among the 
corresponding TFs, TFBSs, and TFTGs remains the predominant 
mechanism governing the gene regulatory processes. 

In order to circumvent the laborious biological experiments for 
screening TFBSs and their corresponding TFTGs, a number of 
computational algorithms have been proposed in the last decade 
on the basis of pre-established biological results [4-16]. Instead of 
directly searching for TFTGs, the majority of these algorithms 
focused on the nucleotide sequence information to screen potential 
TFBSs and ignored the structural property of DNA molecules. 



Local search-based algorithms, such as Gibbs sampling, have been 
applied on certain microorganisms with some success but lacked 
global optimality [4-7] . Position weight matrix-based approaches 
were popular but suffered gready from high false-positive 
prediction rate and the independence assumption among different 
TFBSs [8-10]. Most recendy, He et al. [1 1] refined the traditional 
n-gram profile algorithm based on the fact that a specific TF may 
bind on either a DNA strand or its reverse complement and 
produced satisfactory results. Following their work, Dai et al. [12] 
incorporated a positional signal into each potential TFBS and 
greatly improved prediction performances. Additionally, Meys- 
man et al. [13] discussed a prediction algorithm using DNA 
structural information alone to predict TFBSs. De novo method- 
ology-based predictions did not require any model training based 
on the prior knowledge of TFTGs thus showing its advantages in 
terms of computational cost and classification accuracy [14—16]. 
Unfortunately, all these approaches provided limited classification 
performances and failed to consider the dataset structure when 
interpreting the final results. Particularly, the method proposed by 
Dai et al. [12], which requires arbitrarily choosing thresholds for 
feature selection, could provide limited performances because the 
optimal threshold was not identified. Taking all these weaknesses 
into account, an improved new systematic computational 
approach for TFTG prediction was proposed in our study and 
produced great results. 
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Materials and Methods 

Using the well-documented information domain of the corre- 
sponding TFs, TFBSs, and TFTGs, we constructed a binary 
classifier based on support vector machines (SVM). A standard 
feature extraction, feature selection, model construction, and 
dataset testing paradigm was followed. The feature extraction 
region was limited to within 1000-bp upstream from the 
transcription start point. This frame was verified to contain the 
most amounts of TFBSs from previous biological studies [11,12]. 
Once these 1 000-bp sequences with identified class labels (TFTGs 
or non-TFTGs) were generated, they were then profiled by a new 
reverse-complementary distance-sensitive H-gram profiling 
(RCDSNGP) algorithm designed to better capture the patterns 
of potential conserved motifs and their corresponding positions 
relative to recognized TFBSs. For feature selection, we adopted 
Monte Carlo simulation based on information gain (IG) measure- 
ments to select features that have a p-value smaller than 0.01. 
Finally, each upstream sequence of either TFTGs or non-TFTGs 
was represented by a single data point in a multi-dimensional 
feature space and was later fed to SVM to build prediction models. 

Feature extraction, selection, model training, and testing were 
performed on the basis of a 10-fold cross validation (10-fold GV). 
That is, the entire dataset is randomly and evenly split into 10 
disjoint subsets. Each subset contains a proportion of TFTG 
sequences similar to that of the pre-split dataset, e.g. 19 TFTGs 
+260 non-TFTGs = 279. The final result is calculated from a 
composite of 10 trials. Within each trial, a different subset of 
samples is selected for testing and the other nine subsets of samples 
are used for training. Feature extraction and selection are 
performed within the nine training subsets during each trial, thus, 
selected features are different in each trial. 

Datasets 

The procedures for generating sequence datasets were described 
previously by Dai et al. [12]. In general, auxin response factors 
(ARFs) regulate their target genes by recognizing the primary 
conserved motif 'TGTCTC or its reverse complement 'GA- 
GACA' in the upstream region [17]. However, the presence of this 
conserved motif by itself may not guarantee that the corresponding 
sequence belongs to TFTGs. Goda et al. [18] used Affymetrix 
Genechip to investigate the gene expressions of A. thaliana treated 
with auxin and brassinosteroid and reported that only 186 out of 
2787 genes containing the conserved motif ('TGTCTC or 
'GAGACA') in their 1000-bp upstream region were TFTGs. 

By referring to the accession IDs verified by Goda et al. [18] and 
location information (transcription start point and chromosome 
ID) obtained from TAIR6 Arabidopsis Information Resource 
(ftp://ftp.arabidopsis.org/home/tair/Genes/ 
TAIR6_genome_release), the 1000-bp gene upstream sequences 
of A. thaliana with the conserved motif (186 TFTGs +2601 non- 
TFTGs = 2787) were extracted from the genome sequences 
(ftp:/ / ftp.arabidopsis.org/home/ tair/home/ tair/Sequences/ 
whole_chromosomes). The entire dataset can be downloaded from 
the Samuel Roberts Noble Foundation online supplementary data 
source (available via http://bioinfo.noble.org/manuscript- 
support/TF_Supp/ dataset/). 

Feature extraction 

Each upstream sequence has to be converted into a series of 
numerical values corresponding to its coordinates in a high-order 
feature space for SVM training and prediction purposes. An re- 
gram profiling algorithm was used previously to represent a 
sequence stream by a set of re continuous characters and their 



corresponding frequencies [19]. This approach is analogous to the 
k-mer approach used in other gene sequence studies [20] . 

Because of the double helix structure and base-pairing property 
of DNA, TFs may bind on either strand of a DNA molecule. Thus, 
a conserved motif and its reverse complement should be treated 
equally for each TF. In the light of this, He et al. [1 1] proposed the 
reverse-complementary n-gram profile (RCNP) algorithm, formal- 
ized as follows. 

Definition 1 (RCNP): Given an m-length sequence s = S\, s 2 . ..s m , 
the RCNP of j is a set of K 2-tuples, denoted as RCNP(,y) = {({/i, 
l}> c i)>({/2! r -z), cii---{{fK, r K ), CK)},fk being a distinct n-gram, r k 
being the reverse complement of f k , and c k being the sum of 
frequency counts of/j and r k in s. Additionally, {f k , r k } (k = 1 , 2 . . . K) 
includes all possible combinations of an n-gram and its reverse 
complement in s. 

The essence of RCNP was that an occurrence of either an n- 
gram or its reverse complement will be counted equally as one 
increment of that feature ({f„ r,}). In addition, by limiting the 
feature extraction region within a finite window evenly neighbor- 
ing the center motif (such as 100-bp window size with 50 bp on 
each flank), He et al. [1 1] considered the presence of other possible 
synergic TFBSs within the window beside the center motif. This 
approach was based on the assumption that the closer an re-gram is 
to the primary TFBSs, the stronger its influence on regulating TF 
binding processes. An optimal area under curve (AUC) value of 
0.8949 was obtained on a similar dataset using this RCNP 
algorithm [11]. 

Immediately after He et al.'s work, Dai et al. [1 2] expanded the 
RCNP algorithm into a reverse-complementary position-sensitive 
re-gram profile (RCPSNP) algorithm by incorporating a positional 
information parameter and a position-sensitive parameter into the 
RCNP. The position sensitive parameter was introduced to mainly 
account for the possibility that two identical re-grams extracted 
from a certain window flanking the center motif on the same DNA 
strand may have similar impacts on regulating TF binding 
processes regardless of their positional differences. This feature 
generation scheme yielded an AUC value of 0.73 for the receiver 
operating characteristic (ROC) curve [12]. 

In this study, we propose an improved feature generation 
algorithm. Studies have shown the existence of a composite 
structure containing constitutive elements adjacent to the 
'TGTCTC binding site for ARFs [17,21]. As a result, the auxin 
inducibility was likely affected incrementally by multiple elements. 
In addition, their contribution differences should be related with 
the distance from each element to the primary TFBS. None of the 
previous studies investigated the impact differences between the 
upstream-region elements and the downstream-region elements 
around the primary binding sites. Therefore, it was not logical to 
incorporate each re-gram with a signed integer representing its 
direction and distance relative to the primary TFBS as proposed 
by Dai et al. [12]. Considering all factors described above, we 
introduced the reverse-complementary distance-sensitive re-gram 
profile (RCDSNGP), formalized as follows. 

Definition 2 (RCDSNGP): Given an rez-length sequence s = S\, 
.v 2 ... S{...snj...s m , the RCDSNGP of s with respect to aj-length 
reference subsequence x = s,. ..s^-j is a set of K 2-tuples, denoted 
as RCDSNGP(.v)= {({/,, r u d,}, a),({f 2 , r 2 , d 2 }, e 2 )...({f K , r K , d K }, 
Cicj},f k being a distinct n-gram, r k being the reverse complement of 
f k , d k being the relative distance parameter, and c k being the sum of 
frequency counts of f k and r k with the same d k relative to x in s. 
Additionally, {f h r k , d k ) (k=l, 2...K) include all possible 
combinations of an re-gram, its reverse complement, and its 
distance to x in s. 
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If we denote either f k or r k as an ;z-gram, g= s b ,r (+ y....r ;+ „_; 
(n— Kt+n— Ki or m— n+2>t>i+j— 1), then its relative distance to 
x is calculated as follows. 

d k = t-(i+j-l),if(m-n+2)>t>(i+j-l) 
d k = i-(t + n-\),if(n-\)<(t + n-l)<j 



where NTG and NTG are the numbers of sequences in S 
belonging to TFTGs and non-TFTGs, respectively. 

After observing a specific feature f, we can partition the original 
set S into two distinct subsets: Sf, a set of upstream sequences 
containing f, Sf, a set of upstream sequences without f. Thus, 
S= {Sf, Sf} and the number of classes isjy = 2. The entropy of S 
with respect to /is evaluated as 



Each set ({f, r k , <4}) within a 2-tuple of an RCDSNGP is a 
reverse-complementary distance-sensitive H-gram (RCDSNG), 
synonymously a feature in our study. 

By adopting RCDSNGP, an occurrence of either an n-gram or 
its reverse complement with the same distance to the central TFBS 
will be counted equally as one increment of that feature {{f k , r k , 

4}). 

Table 1 demonstrates an example of an RCDSNGP of a given 
sequence with respect to a designated reference subsequence. 

Feature selection 

We used K-grams of n = 4—9 for profiling each upstream 
sequence using RCDSNGP algorithm because n = 4— 9 were 
verified (data not shown) to give optimal performances and can 
be handled efficiendy in a moderate computing environment. 
Given the maximum distance d max , a total number of <4,a* x 4" 
features at most can be generated for each n, which is less than half 
of the number produced by Dai et al. [12]. 

Given a d max , 10 trials were conducted to generate the final 
result (10-fold CV). Within each trial, a separate IG-based feature 
ranking was used on the nine training subsets [22]. The IG 
measure is based on information theory [23], which calculates the 
entropy differences before and after observing a specific feature. 
The entropy of the set S, which contained e.g. JV= 2509 (168 
TFTGs +2341 non-TFTGs = 2509) upstream sequences and two 
distinct class labels T = { TG, TG\ for the TFTGs and non-TFTGs 
(number of classes y = 2), can be calculated as 



Entropy(S) = - 

N TG 
N 



YJ x=1 P(.Tx,S)xlog(P(Tx,S)) 



1 , N TG. 

xlog( — )- 



^xlog(^) 
N 6 N 



(2) 



Entropy{S\f) = £7=i P ( Sx > S *> x Entropy{Sx) 

Nf ,N T Gf . ,Nt~ Nl 



,- TGf , ,-lVTGf . , 

x( x loa( -)+ 

N N f ev Nr N 



TGf 



N — 
xlog( TGJ 



N- N - 

))-^x(^ 

N/ N N 7 . 
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N- 
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where numbers of upstream sequences with at least one 
occurrence of feature f and no occurrence of / are denoted by 
Nf and Nj, respectively; numbers of upstream sequences 
belonging to TFTGs with at least one occurrence of feature f 
and no occurrence of f are denoted by NTGf and N TG j, 
respectively; and, Nj^- and N^qJ represent the corresponding 
numbers of non-TFTGs. Finally, the IG obtained by dividing S 
according to /is calculated using: 



IG(f) = Entropy(S) - Entropy(S\f) 



(4) 



and a higher IG value implies greater importance of a given 
feature for representing a specific sequence. 

Instead of ranking features based on IGs and subjectively 
choosing the cutoff value, we further evaluated each feature using 
Monte Carlo simulations [24,25]. This approach provides 
information on whether we can statistically differentiate samples 
from two classes on a basis of a given feature. For each feature, we 
shuffled class labels (TFTGs or non-TFTGs) 1 0 000 times without 
changing either the feature count in each sequence or the total 
number of sequences in each class. A new IG was calculated for 
each shuffling, thus, 10 000 IGs were obtained for each feature. 



Table 1 . An example of a reverse-complementary distance-sensitive n-gram profile (RCDSNGP) representation with n = 4, 5, and 6 
for a given sequence (AAGCTTGAGACACAGCT) with the reference subsequence marked in bold*. 





Length of n-gram In) 


Reverse-complementary distance-sensitive n-gram (RCDSNG), or feature 


Frequency count 


n = 4 


{AAGC, GOT, 1} 


1 




{AGCT, AGCT, 2} 


2 




{AAGC, GOT, 3} 


1 


{CAGC, GCTG, 1} 1 


n = 5 


{AAGCT, AGCTT, 1} 


1 


{AAGCT, AGCTT, 2} 1 




{AGCTG, CAGCT, 1} 


1 


n = 6 


{AAGCTT, AAGCTT, 1} 


1 



*Given an m-length sequence S=S\, s 2 ... S;. ..s i+ j...s mi the RCDSNGP of s with respect to an j-length reference subsequence x = s,. ..s,+,_ ? is a set of K 2-tuples, denoted as 
RCDSNGP(s) RCDSNGP(s) = {({f 1 , r 1( d{\, c<i),({f 2r t 2 , d 2 }, c 2 )...({^g 0& d K }, c K )}, 4 being a distinct n-gram, r k being the reverse complement of f k , d k being the relative distance 
parameter, and c k being the sum of frequency counts of f k and r k with the same d k relative to x in s. Each set in a 2-tuple {{f k , r k , d k }) is a reverse-complementary distance- 
sensitive n-gram (RCDSNG), or a feature in our study. This RCDSNGP representation was adopted for all training sequences. For testing processes, each sequence was 
converted to RCDSNGP first, and then represented according to the selected RCDSNGs generated from the training datasets, including those with zero count. 
doi:1 0.1 371 /journal.pone.009451 9.t001 
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The /(-value for a specific feature was calculated according 



to 



Pvalue - 



Nge 

If 



(5) 



where Nge represents the number of shuffling that gave new IG 
values greater than or equal to the original one and JV is the total 
number of shuffling (jV= 10 000). The smaller the /(-value is, the 
stronger the contribution the feature would provide to differentiate 
a sample between two classes. In our study, features were 
considered important at /)<0.01 level. 

Data representation 

Within each trial, we selected F features whose /(-values were 
smaller than 0.01 from the total of 493 781 all possible features 
(number of RCDSNGs obtained from all sequences). Each 1000- 
bp upstream sequence was represented by an RCDSNGP 
consisting of these F features and their corresponding occurrences. 
The whole set of JV sequences (e.g. JV= 168 TFTGs +2341 non- 
TFTGs =2509) were then represented by an Nx(F+l) matrix (1 
extra column for class labels). 

Training and testing 

Support vector machine-based approaches have been widely 
used in various problem domains, such as bioinformatics [26-28] . 
They often outperformed many other classification algorithms 
[29]. We adopted SVM in our study using the LIBSVM [30], 
which is based on sequential minimal optimization. The general 
concept of SVM is given below. 

For model training, given a set of vector-label pairs 
(Xi,yi),i = \,2...N, where N is equal to the number of upstream 
sequences (e.g. JV= 2509); XieR" , where n is the dimension of Xi, 
equal to the number of selected features; j>ie{l, — 1} where 1 
corresponds to TFTGs and -1 corresponds to non-TFTGs, the 
support vector machine computes the solution to the optimization 
problem formalized below: 

min W,b,&- W T W+C S T N s, 

2 ^'=i (6) 

subject to y t ( W T &{Xj) + b > 1 - e, and e, > 0. 



This is equivalent to finding the maximum-margin hyperplane 
that separates TFTG samples (labeled as 1) and non-TFTG 
samples (labeled as -1) with minimized measures of errors. 
Predictions are made based on the geometric location of an 
unknown sample when fed into the model. A label is assigned to a 
sample according to which side of the hyperplane it resides. In 
order to obtain better classification accuracies, data are often 
projected into a high-dimension feature space with a kernel 
function. We evaluated other possible kernels in addition to the 
linear kernel suggested by Dai et al. [12]. Particularly, we 
performed a grid (factorial) search [28] for an optimal combina- 
tion of a penalty factor C of SVM and a kernel width a for the 
Gaussian radial basis function (RBF) kernel. 

Performance measure 

First, we measured the traditional accuracy defined as 



ACCURACY - 



TP+TN 



(7) 



TP+TN+FP+FN 
where all parameters are defined in Table 2. Accuracy provides a 



direct and simple way of evaluating performances, however, it is 
highly sensitive to data distribution [29]. In our study, if a classifier 
predicts every TFTG sequence as a non-TFTG sequence, we can 
still obtain an accuracy of 0.9333 due to the fact that more than 93 
percent of the sequences belong to non-TFTGs. Thus, other 
evaluation metrics are warranted, including precision, recall, and 
Fi [29] defined as follows. 



PRECISION = 



RECALL - 



TP 



TP + FP 



TP 



TP + FN 



Fr 



2 x TP 



TP + FN +TP + FP 



(8) 



(9) 



(10) 



Furthermore, we adopted the ROC curve technique, which 
demonstrated satisfactory performances across different classifiers 
and datasets from previous studies [12]. Some recent studies 
argued that the ROC approach tends to provide an over- 
optimistic evaluation on highly imbalanced datasets [30] . Thus, we 
also examined precision-recall (PR) curves, which appeared more 
informative and valid than ROC curves on skewed datasets [31]. 
The AUC value was calculated on both the ROC curve (ROC- 
AUC) and the PR curve (PR-AUC) to summarize the classification 
results. To better visualize the misclassification costs and statistical 
significances, cost curves [32] were also evaluated. 

Dai et al. [12] reported that a window size of 200 bp (100 bp on 
each flank around the central TFBS, 'TGTCTC or 'GAGACA') 
provided the best ROC-AUC value. In our study, a new feature 
generation scheme was adopted. Therefore, we evaluated different 
maximum distances (d max = half of the window size) on each flank 
of the central TFBSs, including d max =25, 50, 75, 100, 125, 150, 
175, and 200 bps. 

To summarize, we evaluated eight different d max settings within 
each trial on the basis of 10-fold CV. Within each d max setting 
during each trial, an IG-based /(-value selection procedure was 
adopted to select the important (/><0.01) features based on the 
training dataset (nine folds out of 10). Immediately following that, 
we located the optimal combination of C for SVM and a for the 
RBF kernel using the grid (factorial) search. Thus, the optimal 
combination of features, C, and a depended on the training 
dataset within each trial. The final results, including accuracy, 
precision, recall, and F\ as well as the corresponding AUC values 



Table 2. Confusion matrix for performance evaluation with 
positive class label (+1) denoting transcription factor target 
gene (TFTG) and negative class label (-1) denoting non-TFTG. 



True class label 



+1 



Predicted class label 



+1 



TP + " 



TN + 



~ + , + ~ denote true positive, false positive, false negative, and true 
negative, respectively. 
doi:1 0.1 371/journal.pone.009451 9.t002 
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Table 3. Number of unique features (union of selected features with p-value <0.01 based on 10-fold cross validation) and 
classification performances [evaluated as accuracy, precision, recall, F q , area under the curve (AUC) value of receiver operating 
characteristic (ROC) curve, and AUC value of precision-recall (PR) curve] affected by the maximum distance on each flank from the 
central binding site {d max ) based on 10-fold cross-validation on transcription factor target gene prediction using reverse- 
complementary distance-sensitive n-gram profile algorithm with n = 4-9 and support vector machine with Gaussian radial basis 
function kernel. 







Unique Feature 


Accuracy 


Precision 


Recall 


F, 


ROC-AUC 


PR-AUC 


25 


893 


0.9476 


0.6923 


0.3871 


0.4966 


0.7202 


0.4180 


50 


1870 


0.9541 


0.7544 


0.4624 


0.5733 


0.7562 


0.5286 


75 


2732 


0.9559 


0.7739 


0.4785 


0.5914 


0.7739 


0.5554 


100 


3502 


0.9566 


0.7519 


0.5215 


0.6159 


0.7720 


0.5808 


125 


4255 


0.9580 


0.7899 


0.5054 


0.6164 


0.7640 


0.5646 


150 


4949 


0.9602 


0.8319 


0.5054 


0.6288 


0.7626 


0.5690 


175 


5622 


0.9587 


0.8034 


0.5054 


0.6205 


0.7673 


0.5639 


200 


6136 


0.9580 


0.7805 


0.5161 


0.6214 


0.7664 


0.5879 



doi:1 0.1 371/journal.pone.009451 9.t003 



were generated from the combined classification results containing 
all 2787 sequences according to different d max settings. 

Results 

Extraction of features from sequences 

Using a maximum distance d max =150 and n = 4-9 for building 
RCDSNGP, 2 455 252 unique features were generated from 
1000-bp upstream of 2787 sequences (186 TFTGs +2601 non- 
TFTGs). We eliminated singleton features that contain only 1 
occurrence across all sequences to reduce the feature size down to 
735 624. The same technique was applied to all other settings of 
dmax a s well. 

Selection of representative features 

Information gain calculated for each feature was used in Monte 
Carlo simulation to generate a />-value for that feature. We selected 
those features that had /(-values smaller than 0.01. For different 
maximum distances, d max =25, 50, 75, 100, 125, 150, 175, and 
200, we selected 893, 1870, 2732, 3502, 4255, 4949, 5622, and 
6136 unique features, respectively on the basis of 10-fold CV 
(Table 3). Table 4 lists the top 20 features ranked by their /j-values 
with corresponding IGs, when d max = 150. 

Classifier performances 

We implemented our own 10-fold CV SVM with Python 
programming language on the basis of LIBSVM. Using the 
/><0.01 threshold, models were constructed based on different 
combinations of maximum distances and kernels (polynomial, 
RBF, sigmoid, and linear kernels), with H-grams of n = 4— 9. The 
RBF kernel provided the best performances regardless of 
measurement metrics used. In our study, the best accuracy 
(0.9602), precision (0.8319), and FI (0.6288) values were obtained 
with d max = 150. The best recall (0.5215), ROC-AUC (0.7739), and 
PR-AUC (0.5879) values were obtained with d max = 100, 75, and 
200, respectively (Table 3). 

Figure la shows the response of accuracy, precision, recall, and 
Fi values versus d max . Accuracy fails to provide adequate 
information on evaluating the minority samples (TFTGs). 
Combining different evaluation metrics tends to provide compre- 
hensive assessment of classification on imbalanced datasets. When 
dmax = 25, our model suffered from low recall rate because of its 



poor accuracy when classifying the positive samples (TFTGs). 
Starting from d max — 50, a boost in performances was observed in 
recall and Fl values as a result of increased accuracy in predicting 
positive samples. At d max — 150, the model reaches the highest 
accuracy value of 0.9602. The best precision value is 0.8319 when 
d ma x = 150 and declines a little as d max increases. Starting from 
dmax r = 100, recall values are above 0.5 and peak at d max = 100 with 
a value of 0.5215. Likewise, the F\ scores are above 0.6 when d max 
is bigger or equal to 100 and reach the greatest value of 0.6288 
when d max = 150. Figure lb shows an AUC-versus-rf mnl curve based 
on both the ROC curve and the PR curve. The ROC-AUC value 
arrives at 0.7739 when d max =75. A slight decrease is detected 
when d max >\00. Furthermore, by varying d max the PR-AUC value 
responds more obviously than the ROC curve. Beginning at 
d m ax = 100, our model produces above-0.56 PR-AUC values, 
which gradually decrease from d max = 100 to d,„ ax = 125 and peaks 
at 0.5879 when d max = 200. Overall, a \50-d max setting is likely to 
give a superior performance with limited complexity of compu- 
tation compared to other maximum distance settings. Although it 
fails to produce the optimal AUC values for either ROC or PR 
curve, it provides the best accuracy, precision, and Fi values. 
Considering the fact that most d max value settings perform well in 
detecting the correct negative samples (non-TFTGs), the model 
that is most capable of identifying the correct positive samples 
(TFTGs; high precision value) yields the best results. A maximum 
distance d max = 200 provides great PR-AUC values. However, it 
adds great computational cost for feature generation and selection 

processes (possibly 50 x E;=4 4 ' = 17 472 000 

more features) and 

some of its performance metrics are even worse than d max — 150. 
Therefore, we conclude that features within a maximum distance 
d m ax = 1 50 around central TFBSs contain sufficient information for 
making accurate prediction on TFTGs. 

In order to show the advantages of our proposed RCDSNGP 
algorithm compared to other algorithms [12], we also examined 
the ROC, PR, and cost curve as well as precision, recall, and F\ 
value generated by different algorithms [12] based on the same 10- 
fold CV split. Additionally, comparisons were made among 
classifiers with different kernels. Figure 2a indicates that 
RCDSNGP-based model with RBF kernel outperformed all other 
models because it generates a curve that is closer to the perfect 
classification point (0,1) in the ROC curve compared with all 
others. Interestingly, the polynomial kernel produced bad result 
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Table 4. The top 20 smallest p-value reverse-complementary 
distance-sensitive n-grams (RCDSNGs; n = 4-9) selected as 
representative features with their information gain (IG) values 
and p-values in the 1000-bp upstream of 186 transcription 
factor target genes (TFTGs) and 2601 non-TFTGs, when the 
maximum distance (half of the window size) on each flank of 
the central transcription factor binding site d max = 150. 



Performance vs. maximum distance 





Ranking 


RCDSNG (feature) 


IG value 


p-value 


1 


{ACACGT, ACGTGT, 4} 


0.001885 


0 


2 


{CGAGAA, TTCTCG, 82} 


0.001884 


0 


3 


{AATATAA, TTATATT, 52} 


0.001884 


0 


4 


{ACTTCC, GGAAGT, 30} 


0.001880 


0 


5 


{ACACC, GGTGT, 44} 


0.001880 


0 


6 


{GTAC, GTAC, 39} 


0.001701 


0 


7 


{CAAACA, TGTTTG, 149} 


0.001 707 


0 


8 


{AAAAATA, TATTTTT, 44} 


0.001 707 


0 


9 


{AGTAT,ATACT, 51} 


0.001713 


0 


10 


{ATGATTA, TAATCAT, 130} 


0.001656 


0 


11 


{ACTTC, GAAGT, 30} 


0.001516 


0 


12 


{CTAAC, GTTAG, 91} 


0.001476 


0 


13 


{ACAAATA, TATTTGT, 71} 


0.001463 


0 


14 


{ATACG, CGTAT, 49} 


0.001463 


0 


15 


{AAAACC, GGTTTT, 75} 


0.001463 


0 


16 


{AAAGACA, TGTCTTT, 117} 


0.001463 


0 


17 


{TAAAACA, TGTTTTA, 85} 


0.001463 


0 


18 


{AGTATA, TATACT, 124} 


0.001458 


0 


19 


{AATGTG, CACATT, 43} 


0.001412 


0 


20 


{ATACCC, GGGTAT, 16} 


0.001412 


0 
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because it predicted each sample with a fixed score of -1. 
Classifiers that dominate in ROC space should dominate in PR 
space as well [33]. This is vividly presented by Figure 2b. 
Furthermore, our study highlighted the drawbacks of over- 
dependency on the ROC-AUC value when evaluating the 
classification performances. Although the difference in ROC- 
AUC values was relatively small (0.7626 versus 0.5055) between 
RCDSNGP and RCPSNP algorithms, the difference in PR-AUC 
values was remarkable (0.569 versus 0.0773). 

Finally, we evaluated the cost curve of each model. The cost 
curve, which emphasizes the expected misclassification cost for 
performance measure, is proposed to better visualize the 
misclassification cost and compare performances on the basis of 
statistical significance [32] . Each data point in the ROC curve gets 
mapped to a distinct straight line (cost line) in the cost curve by 
connecting point (0, FP) and point (1, FN). Multiple points in the 
ROC space would generate several cost lines and form a lower 
envelope, which is shown in Figure 2c. The x-axis, denoted as 
probability cost, includes all possible percentage values of positive 
samples (0 to 1). It represents the proportion of positive samples 
when the classifier is deployed. Thus, at x = 0 (no positive samples), 
the only possible misclassification errors are FPs. Likewise, at x— 1 
(no negative samples), the only possible misclassification errors are 
FNs. The straight line connecting these two points represents the 
trend of misclassification cost as percentage of positive samples 
varies. The lower envelope generated by a non-discrete classifier 
such as SVM or Artificial Neural Network is the counterpart for 
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Figure 1. Performances of transcription factor target gene 
prediction affected by the maximum distance on each flank 
from the binding site (d max ) based on 10-fold cross-validation 
using reverse-complementary distance-sensitive /7-gram pro- 
file algorithm with n = 4-9 and support vector machine with 
Gaussian radial basis function kernel. (A) Performance evaluation 
metric (accuracy, precision, recall, and F q ) values versus d mm on each 
flank from the central binding site. (B) Area under the curve (AUC) value 
of receiver operating characteristic (ROC) curve and precision-recall (PR) 
curve versus d max on each flank from the central binding site. 
doi:10.1371/journal.pone.0094519.g001 

the upper convex hull of the ROC curve. At each probability cost 
value, the closer the curve is to the x-axis, the better the classifier 
performs (a lower expected cost). As presented in Figure 2c, the 
RCDSNGP-based model with RBF kernel has the lowest cost 
from 0 to 68 percent of positive samples. Additionally, it is also 
verified to be different (/)<0.05) from RCPSNP-based model 
within the 4.8 to 65 percent range of positive samples using the 
method proposed by Drummond and Holte [32] . In our study, the 
dataset contains 6.7 percent of positive samples, which is within 
the 4.8 to 59 percent range, thus our model outperformed 
RCPSNP-based model in terms of expected cost. 

To summarize, using new feature generation and selection 
strategies to predict TFTGs of ARFs in A. thaliana based on 
published datasets [12], we drastically increased classification 
performances. Our best result was obtained when d max = 150 using 
the RBF kernel based on an average of 2395 features. We adopted 
the ROC measure for efficacy evaluation and obtained an ROC- 
AUC value of 0.7626 (SE = 0.0021), accuracy value of 0.9602 
(SE = 0.0023), precision value of 0.8319 (SE = 0.0331), recall value 
of 0.5054 (SE = 0.0231), and F l score of 0.6288 (SE = 0.0211; 
Table 3), which were higher than the best result reported by Dai et 
al. (12; ROC-AUC = 0.73, accuracy =0.69, precision =0.3684, 
recall =0.1129, and ^ = 0.1728) based on the same dataset but 
with a different 10-fold CV split. 
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Receiver operating characteristic (ROC) curve 
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Figure 2. Classification performances using the optimal 
maximum distance on each flank from the binding site 

(d max ='\50). (A) Receiver operating characteristic (ROC) curve, (B) 
precision-recall (PR) curve, and (C) cost curve of the 10-fold cross- 
validation on transcription factor target gene prediction using reverse- 
complementary distance-sensitive n-gram profile (RCDSNGP) algorithm 
with d max = 150 and n = 4-9 based on different support vector machine 
(SVM) kernels, reverse-complementary position-sensitive n-gram profile 
(RCPSNP) algorithm using linear-kernel SVM, and Position-Specific 
Scoring Matrices (PSSM)-based approach. 
doi:1 0.1 371 /journal.pone.009451 9.g002 



Additionally, Dai et al. [12] reported the performance of a 
Position-Specific Scoring Matrices (PSSM)-based approach using 
the cluster-buster algorithm [34], which only yielded an ROC- 
AUC value of 0.51. We implemented a traditional approach based 
on the position frequency matrix method using a similar feature 
encoding algorithm explained in Youn et al. [26] . Each sequence 
was parsed according to a d max — 150 setting and the sequence 
conservation was evaluated using a four (four nucleotides) by 300 
(2x<4hk) position frequency matrix (150-bp flanking each side of 
the primary conserved motif). At each residue (1 out of 300), we 
considered 20-bp window size (left: 10, right: 10) to construct the 
frequency count for each nucleotide. The standard SVM-based 
training and testing was performed based on generated PSSMs. 
Likewise, the algorithm only provided an ROC-AUC value of 
0.5569 and a PR-AUC value of 0.0801 using the same 10-fold CV 
split as our RCDSNGP algorithm (Table 5). The performance 
curves based on PSSM are also presented in Fig. 2. 

Furthermore, we also implemented the RCPSNP algorithm 
proposed by Dai et al [12] using the optimal settings (e.g. n = 4—9, 
linear-kernel SVM) and applied it to the same 10-fold CV split as 
our RCDSNGP algorithm, which yielded an ROC-AUC value of 
0.5055, PR-AUC value of 0.0773, accuracy value of 0.9300, 
precision value of 0.2000, recall value of 0.0161, and F\ score of 
0.0299 (Table 5). Our classifier generated points much closer to 
the perfect classification point (0,1) in the ROC curve than those 
generated by RCPSNP algorithm (Fig. 2a; 12). Most importantly, 
considering that traditional metrics for measuring classification 
performances tended to provide deceiving or inadequate infor- 
mation of imbalanced datasets, we also evaluated other metrics 
and their corresponding curves such as PR and cost curves, which 
showed gready improved results as well. 

The detailed model files, 10-fold CV datasets represented as 
matrices, and classification results are available at the supplemen- 
tary online data source. 

Discussion 

Understanding the mechanism of gene regulatory network is a 
challenging task. As of today, there is still much uncertainty in 
identifying the corresponding TFBSs and TFTGs. More TF and 
TF-dependent target gene regulation studies are required to 
evaluate the biological function and mechanism of more gene 
regulation players. The activity and affinity of TF would be the 
ultimate balanced result of the various check points of biological 
regulation. The binding efficiency of TF to its corresponding 
TFBS is regulated by various factors, including TF synthesis, 
ligand binding to the TFs, and DNA binding mechanism through 
post-translational modifications such as phosphorylation and 
glycosylation of the TFs. In addition, the DNA binding process, 
dimerization, and interactions with cofactors for the functional 
complex formation are important parameters controlling the TF 
activity [35,36]. As more information of the interplay among TF, 
its corresponding TFBS, and TFTG accumulates, it could be 
possible to understand the precise TFTG expression affected by 
different TFs. A number of computational approaches that rely on 
well-known TFBSs have been proposed, but a majority of these 
algorithms suffered from high FP rate [12,37]. Therefore, much 
effort was put on reducing FP rates and increasing prediction 
accuracies [12], whereas the importance of the dataset structure 
was ignored. In our study, only 186 out of 2787 genes that all 
contain the binding site ('TGTCTC' or 'GAGACA') in their 1000- 
bp upstream region were TFTGs. If our model correctly predicted 
all negative samples (non-TFTGs) and miss-predicted all positive 
samples (TFTGs), it still yielded an accuracy value of 0.9333 
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Table 5. Classification performances [evaluated as accuracy, precision, recall, F v area under the curve (AUC) value of receiver 
operating characteristic (ROC) curve, and AUC value of precision-recall (PR) curve] using different feature encoding algorithms with 
optimal parameter settings for SVM and d mox =150, including reverse-complementary distance-sensitive n-gram profile 
(RCDSNGP), reverse-complementary position-sensitive n-gram profile (RCPSNP), and a Position-Specific Scoring Matrices (PSSM)- 
based algorithms. 



Feature Encoding Algorithm Accuracy Precision Recall F n ROC-AUC PR-AUC 

RCDSNGP 0.9602 0.8319 0.5054 0.6288 0.7626 0.5690 

RCPSNP 0.9300 0.2000 0.0161 0.0299 0.5055 0.0773 

PSSM 0.9332 NAN 0 0 0.5569 0.0801 



doi:1 0.1 371/journal.pone.009451 9.t005 

(2601/2787) but with precision value undefined and zero values 
for both recall and F\ score. Therefore, minimizing the FP rate or 
maximizing the accuracy contributes little to improving overall 
performances when analyzing a highly skewed dataset. It is 
important to analyze different evaluation metrics to better assess 
the classification performances. 

We deployed a novel feature extraction method (RCDSNGP) 
that incorporated a relative distance parameter into each feature 
to count for the positional information of each motif relative to the 
central TFBS. For feature selection, we adopted a Monte Carlo 
simulation-based statistical approach rather than arbitrarily 
choosing thresholds. We compared our results with the 
RCPSNP-based approach [12] on the same dataset. Our best 
model achieved an accuracy of 0.9602 and an ROC-AUC value of 
0.7627 when d max = 150 compared with 0.69 and 0.73 reported by 
Dai et al. [12], respectively. Dai et al. introduced three parameters 
for constructing RCPSNP, including a number of n-grams C 
(analogous to our maximum distance d max ), a top F representative 
features based on IG, and a position sensitive factor P (the identical 
n-grams located within a P-bp region neighboring the central 
binding site are counted equally). The best result was obtained 
when n-4-9, (7=100, P= 100, and F= 1000. Their detailed 
results containing prediction scores can be found in their 
supplementary web data [12]. Moreover, little positional infor- 
mation is considered when C equals P. In other words, RCPSNP 
behaves almost the same as RCNP [1 1] when C and P hold the 
same value. The significant performance increase based on our 
RCDSNGP algorithm indicated that ARFs function by recogniz- 
ing multiple consensus motifs that might be co-occurring TFBSs or 
subsequences functioning coordinately. More importantly, the 
relative distance from each motif to the binding site should always 
play an important role in the gene regulation process. The 
structural complexities of protein and DNA may result in a type of 
mutual recognition that relies more on the distance from the 
conserved motif to the TFBS, regardless of where the motif lies 
(downstream or upstream of the central TFBS). The PSSM-based 
approaches may be useful for TFTG identification when more 
associated TFBSs are known. 

Identifying patterns of other potential TFBSs and the binding 
property of ARFs by enumerating all possible H-grams is a 
computationally expensive work. The complexity becomes even 
greater when a distance parameter is included. Therefore, better 
feature selection methods become necessary. We employed a 
statistical systematic approach. Based on a given feature, two class 
samples are different from each other if, and only if, the 
probability of obtaining a bigger IG value than original is below 
a certain level (p-value). This probability value is obtained using 



Monte Carlo simulations [38]. We verified that our feature 
selection algorithm is robust for a range of /(-values (between 0.005 
and 0.01). However, when the Rvalue becomes bigger, feature 
number increases drastically, which greatly increases computa- 
tional cost. Moreover, we also evaluated a number of important 
features that have a/)-value smaller than the 0.01 threshold versus 
different d max values (data not shown). The slope of the curve 
reached the maximum value between d max = 25 and d max = 50 and 
began to dwindle when d max >50, suggesting that flank regions 
closer to the core motif contain more important features for 
predicting TFTGs. 

Precision-recall curve is used in information retrieval as an 
alternative to ROC curve when analyzing unbalanced datasets 
[33]. Optimal prediction models tend to generate curves close to 
the upper-left corner in the ROC curve and upper-right corner in 
the PR curve. Likewise, the cost curve is introduced to measure the 
performances by varying class probabilities to generate confidence 
intervals [32]. Regardless of which curve was used, RCDSNGP- 
based approaches using RBF kernel demonstrated significant 
advantages over the RCPSNP-based approach [12]. The polyno- 
mial kernel somehow yielded much poorer performances than 
others. Cost curves verified the similar effects by showing that 
superior models always generate a lower envelope curve than 
inferior ones. In other words, superior models always have 
significandy lower misclassification cost within a certain percent- 
age range of positive samples. 

Taken altogether, the RCDSNGP algorithm combined with 
statistical feature selection methods provides an efficient and 
highly accurate way to predict TFTGs on the basis of well-studied 
TFBSs. We believe that this improved methodology can be 
employed when analyzing other species besides A. thaliana. It might 
also provide new insights into the understanding of gene 
regulatory networks. 

Acknowledgments 

We thank the associated editor and two anonymous reviewers for their 
valuable comments and constructive suggestions on the earlier draft of the 
manuscript. 

Supplementary Data 

Online supplementary data are available at http://redwood.cs.ttu.edu/ 
~euyoun/TFTG.html. 

Author Contributions 

Conceived and designed the experiments: SC EYJL SM. Performed the 
experiments: SC EY. Analyzed the data: SC EYJL. Contributed reagents/ 
materials/analysis tools: SC EY. Wrote the paper: SC EYJL SM. 



PLOS ONE | www.plosone.org 



8 



April 2014 | Volume 9 | Issue 4 | e94519 



Transcription Factor Target Gene Prediction 



References 

1. Sinha S, Tompa M (2002) Discovery of novel transcription factor binding sites 
by statistical overreprcsentation. Nucleic Acids Res 30: 5549-5560. 

2. Elbashir SM, Harborth J, Lendeckel W, Yalcin A, Weber K, et al. (2001) 
Duplexes of 21 -nucleotide RNAs mediate RNA interference in cultured 
mammalian cells. Nature 41 1: 494-498. 

3. Ruvkun G (2001) Molecular biology. Glimpses of a tiny RNA world. Science 
294: 797-799. 

4. Lawrence CE, Altschul S, Boguski M, Liu J, Neuwald A, ct al. (1993) Detecting 
subtle sequence signals: a Gibbs sampling strategy for multiple alignment. 
Science 262: 208-214. 

5. Hughes JD, Estep PW, Tavazoie S, Church GM (2000) Computational 
identification of cis-regulatory elements associated with groups of functionally 
related genes in Saccharomyces cerevisiae. J Mol Biol 296: 1205-1214. 

6. Robison K, McGuire AM, Church GM (1998) A comprehensive library of 
DNA-binding site matrices for 55 proteins applied to the complete Escherichia 
coli K-12 genome. J Mol Biol 248: 241-254. 

7. McCue LA, Thompson W, Carmack CS, Ryan M, Liu J, et al. (2001) 
Phylogcnctic footprinting of transcription factor binding sites in proteobacterial 
genomes. Nucleic Acids Res 39: 774-782. 

8. Stormo GD (2000) DNA binding sites: representation and discovery. Bioinfor- 
matics 16: 16—23. 

9. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, et al. (2006) 
TRANSFAC and its module TRANSCompel: transcriptional gene regulation in 
cukaryotes. Nucleic Acids Res 34: D108-D110. 

10. Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, et al. (2003) 
MATCH: A tool for searching transcription factor binding sites in DNA 
sequences. Nucleic Acids Res 31: 3576-3579. 

11. He J, Dai X, Zhao X (2006) A systematic computational approach for 
transcription factor target gene prediction. 2006 IEEE Symposium on 
Computational Intelligence in Bioinfbrmatics and Computational Biology 
(CIBCB 2006) Toronto, Ontario, Canada, pp. 385-391. 

12. Dai X, He J, Zhao X (2007) A new systematic computational approach to 
predicting target genes of transcription factors. Nucleic Acids Res 35: 4433— 
4440. 

13. Meysman P, Dang TH, Laukens K, Smet RD, VVu Y, ct al. (2010) Use of 
structural DNA properties for the prediction of transcription-factor binding sites 
in Escherichia coli. Nucleic Acids Res 39: e6. 

14. Boeva V, Surdez D, Guillon N, Tirode F, Fejes AP, et al. (2010) De novo motif 
identification improves the accuracy of predicting transcription factor binding 
sites in ChlP-Scq data analysis. Nucleic Acids Res 38: el 26. 

15. Tompa M, Li N, Bailey TL, Church GM, Dc Moor B, et al. (2005) Assessing 
computational tools for the discovery of transcription factor binding sites. Nat 
Biotechnol 23: 137-144. 

16. Friberg M, von Rohr P, Gonnet G (2005) Scoring functions for transcription 
factor binding site prediction. Bmc Bioinform 6: 84. 

17. Ulmasov T, Liu ZB, Hagcn G, Guilfoyle TJ (1995) Composite structure of auxin 
response elements. Plant Cell 7: 161 1-1623. 

18. Goda H, Sawa S, Asami T, Fujioka S, Shimada Y, et al. (2004) Comprehensive 
comparison of auxin-regulated and brassinosteroid-regulated genes in Arabi- 
dopsis. Plant Physiol 134: 1555-1573. 

19. Shannon C (1997) A mathematical theory of communication. Bell Syst Tech J 
27: 379-423. 



20. Wang Qo Garrity GM, Tiedjc JM, Cole JR (2007) Naive Bayesian classifier for 
rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl 
Environ Microbiol 73: 5261-5267. 

21. Liu ZB, Ulmasov T, Shi X, Hagcn G, Guilfoyle TJ (1994) Soybean GH3 
promoter contains multiple auxin-inducible elements. Plant Cell 6: 645—657. 

22. Youn E, Jeong MK (2009) Class dependent feature scaling method using naive 
Bayes classifier for text datamining. Pattern Recognit Lett 30: 477-485. 

23. Yang Y, Pedcrsen JP (1997) A Comparative Study on Feature Selection in Text 
Categorization. Proceedings of the Fourteenth International Conference on 
Machine Learning Morgan Kaufmann Publishers Inc., Nashville, TN, USA, pp. 
412-420. 

24. White JR, Nagarajan N, Pop M (2009) Statistical methods for detecting 
differentially abundant features in clinical metagenomic samples. PLoS Comput 
Biol 5: cl000352. 

25. Pitta DW, Pinchak WE, Dowd SE, OsterstockJ, Gontcharova V, et al. (2010) 
Rumen bacterial diversity dynamics associated with changing from bcrmuda- 
grass hay to grazed winter wheat diets. Microb Ecol, 59: 511-522. 

26. Youn E, Peters B, Radivojac P, Mooney SD (2007) Evaluation of features for 
catalytic residue prediction in novel folds. Protein Sci 16: 216-226. 

27. Tzahor S, Aharonovich DM, Kirkup B, Yogev T, Frank IB, et al. (2009) A 
supervised learning approach for taxonomic classification of corc-photosystcm-II 
genes and transcripts in the marine environment. BMC Genomics 10: 229. 

28. Patil K, Haider P, Pope P, Turnbaugh P, Morrison M, et al. (201 1) Taxonomic 
metagenome sequence assignment with structured output models. Nat Methods 
8: 191-192. 

29. Joachims T (1999) Making large-Scale SVM Learning Practical. In: Seholkopf 
B, Burges C, Smola A, editors. Advances in Kernel Methods - Support Vector 
Learning. Cambridge: MIT press, pp. 41—56. 

30. Chang CC, Lin CJ (2011) LIBSVM: a library for Support Vector Machines. 
ACM Trans Intell Syst Technol 2: 1-27. 

31. He H, Garcia EA (2009) Learning from imbalanccd data. IEEE Trans. 
Knowledge Data Eng 21: 1263-1284 

32. Drummond C, Holte RC (2006) Cost curve: an improved method for visualizing 
classifier performance. Mach Learn 65: 95-130. 

33. Davis J, Goadrich M (2006) The relationship between precision-recall and ROC 
curves. Proceedings of the twenty-third International Conference on Machine 
Learning, Pittsburgh, PA, USA, pp. 233-240. 

34. Siggers T, Duyzend MH, Reddy J, Khan S, Bulyk ML (2011) Non-DNA- 
binding cofactors enhance DNA-binding specificity of a transcriptional 
regulatory complex. Mol Syst Biol 7: 555. 

35. Stower H (2012) Gene regulation: Resolving transcription factor binding. Nat 
Rev Genet 13: 71. 

36. Cartharius K, Freeh K, Grote K, Klockc B, Haltmeier M, et al. (2005) 
Matlnspector and beyond: promoter analysis based on transcription factor 
binding sites. Bioinfbrmatics 21: 2933-2942. 

37. Frith MC, Li MC, Wcng Z (2003) Cluster-Buster: finding dense clusters of motifs 
in DNA sequences. Nucleic Acids Res 31: 3666-3668. 

38. Draminski M, Rada-Iglesias A, Enroth S, Wadclius C, KoronackiJ, et al. (2008) 
Monte carlo feature selection for supervised classification. Bioinfbrmatics 24: 
110-117. 



PLOS ONE | www.plosone.org 



9 



April 2014 | Volume 9 | Issue 4 | e94519 



